第七讲的主题是Lauguage,这里总结第七讲以及第七次作业。

课程地址:https://cs50.harvard.edu/ai/

备注:图片均来自课程课件。

Natural Language Processing

Context-Free Grammar

Context-Free Grammar指定语法规则,例如

NP → N | D N
N → she | city | car | Harry | …
D → the | a | an | …
V → saw | ate | walked | …
P → to | on | over | …
ADJ → blue | busy | old | …

利用Context-Free Grammar会产生解析树:

nltk

$n-$gram

文本中连续$n$项,例如连续$n$个单词或者字母。

tokenization

将字符序列拆分为多个部分(tokens)的任务,常见的token有word token以及sentence token。

bag-of-words model

将文本表示为单词的无序集合的模型。

Naive Bayes

考虑文本分类问题,例如

朴素贝叶斯假设条件独立性,即

其中

additive(Laplace smoothing) smoothing

注意上述参数估计方法会产生很多$0$,所以需要对参数估计平滑化,常见的方式为给每一项增加$\alpha$,如果$\alpha=1$则为拉普拉斯平滑。

information retrieval

信息检索是对用户查询查找相关文档进行反馈的任务。

topic modeling

主题模型是用于发现一组文档主题的模型。

term frequency

词频是术语在文档中出现的次数。

function words

本身没有什么意义的词,但用于语法上连接其他词,例如am, by, do, is, which, with, yet, …

content words

带有独立含义的单词,例如algorithm, category, computer, …

inverse document frequency

逆文档频率衡量文档中单词的普遍或稀有程度,计算公式为

tf-idf

通过将术语频率(TF)乘以逆文档频率(IDF)来对文档中重要单词进行排名。

Word Representation
one-hot representation
  • he [1, 0, 0, 0]
  • wrote [0, 1, 0, 0]
  • a [0, 0, 1, 0]
  • book [0, 0, 0, 1]

one-hot表示法无法表达单词之间的关系,并且不具有扩展性,例如增加一个新的单词就需要对全部单词重新编码。

word2vec

该方法利用的思路是一个单词和其上下文关系很大,然后使用神经网络进行训练,最后可以得到如下有趣的结果:

Project

Parser

def judge(word):
    flag = False
    for s in word:
        if s.isalpha():
            flag = True
            break
    
    return flag

def preprocess(sentence):
    """
    Convert `sentence` to a list of its words.
    Pre-process sentence by converting all characters to lowercase
    and removing any word that does not contain at least one alphabetic
    character.
    """
    words = nltk.wordpunct_tokenize(sentence)
    res = []
    for word in words:
        word_lower = word.lower()
        if judge(word_lower):
            res.append(word_lower)
    
    return res

def np_chunk_helper(tree, List):
    if tree == None:
        return
    #计算NP的数量
    cnt = 0
    tmp = []
    for t in tree.subtrees():
        if t.label() == "NP":
            tmp.append(t)
            cnt += 1
    #如果只有1个NP则更新
    if cnt == 1:
        t = tmp[0]
        #防止重复
        if t not in List:
            List.append(t)
        return
    elif cnt > 1:
        for t in tree.subtrees():
            #排除自己
            if t != tree:
                np_chunk_helper(t, List)
    
        return
    
    return

def np_chunk(tree):
    """
    Return a list of all noun phrase chunks in the sentence tree.
    A noun phrase chunk is defined as any subtree of the sentence
    whose label is "NP" that does not itself contain any other
    noun phrases as subtrees.
    """
    List = []
    np_chunk_helper(tree, List)
    
    return List

Questions

def load_files(directory):
    """
    Given a directory name, return a dictionary mapping the filename of each
    `.txt` file inside that directory to the file's contents as a string.
    """
    res = dict()
    filenames = os.listdir(directory)
    for filename in filenames:
        path = os.path.join(directory, filename)
        sentence = ""
        with open(path, "r", encoding='UTF-8') as file:
            for f in file.readlines():
                sentence += f.strip() + " "
        res[filename] = sentence
        
    return res

def tokenize(document):
    """
    Given a document (represented as a string), return a list of all of the
    words in that document, in order.

    Process document by coverting all words to lowercase, and removing any
    punctuation or English stopwords.
    """
    tmp = nltk.word_tokenize(document)
    words = [t.lower() for t in tmp]
    res = []
    for word in words:
        if word not in string.punctuation and word not in nltk.corpus.stopwords.words("english"):
            res.append(word)
    
    return res


def compute_idfs(documents):
    """
    Given a dictionary of `documents` that maps names of documents to a list
    of words, return a dictionary that maps words to their IDF values.

    Any word that appears in at least one of the documents should be in the
    resulting dictionary.
    """
    res = dict()
    n = len(documents)
    tmp_documents = dict()
    all_words = set()
    for filename in documents:
        tmp_documents[filename] = set(documents[filename])
        all_words.update(tmp_documents[filename])

    for word in all_words:
        cnt = 0
        for filename in tmp_documents:
            if word in tmp_documents[filename]:
                cnt += 1
        res[word] = np.log(n / cnt)
        
    return res
            
def compute_tf(word, words):
    cnt = 0
    for w in words:
        if w == word:
            cnt += 1
    
    return cnt

def top_files(query, files, idfs, n):
    """
    Given a `query` (a set of words), `files` (a dictionary mapping names of
    files to a list of their words), and `idfs` (a dictionary mapping words
    to their IDF values), return a list of the filenames of the the `n` top
    files that match the query, ranked according to tf-idf.
    """
    tmp = []
    for filename in files:
        score = 0
        for word in query:
            if word in idfs:
                score += compute_tf(word, files[filename]) * idfs[word]
        tmp.append((filename, score))
        
    tmp.sort(key=lambda x: -x[1])
    res = []
    for i in range(n):
        res.append(tmp[i][0])
    
    return res
    
def compute_cnt(word, sentence):
    score = 0
    for s in sentence:
        if s == word:
            score += 1
    
    return score / len(sentence)

#从小到大
def cmp(a, b):
    if a[1] != b[1]:
        return b[1] - a[1]
    else:
        return b[2] - a[2]

def top_sentences(query, sentences, idfs, n):
    """
    Given a `query` (a set of words), `sentences` (a dictionary mapping
    sentences to a list of their words), and `idfs` (a dictionary mapping words
    to their IDF values), return a list of the `n` top sentences that match
    the query, ranked according to idf. If there are ties, preference should
    be given to sentences that have a higher query term density.
    """
    tmp = []
    for sentence in sentences:
        s1 = 0
        s2 = 0
        for word in query:
            if word in sentences[sentence]:
                s1 += idfs[word]
            s2 += compute_cnt(word, sentences[sentence])
        tmp.append((sentence, s1, s2))
        
    tmp = sorted(tmp, key=functools.cmp_to_key(cmp))
    
    res = []
    for i in range(n):
        res.append(tmp[i][0])
    
    return res